{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dealing with huge datasets\n",
"\n",
"Some data is too large for our small laptop. Some data is even too large for our terrabyte servers. Being smart about loading data can help us build better data science pipelines.\n",
"\n",
"As data science and machine learning become increasingly popular, the size of data sets used for analysis has grown exponentially. Sometimes, the data is so large that it cannot be loaded into memory on a single machine, even on a terabyte server. \n",
"\n",
"In such cases, loading data can become a bottleneck, hindering data analysis and decision-making. To overcome this challenge, data scientists must be innovative about loading data and building efficient data pipelines. Various techniques and tools are available to process and analyze large datasets, including parallel computing, distributed systems, cloud computing, and data streaming. \n",
"\n",
"This notebook will discuss the challenges of loading large datasets and explore some best practices for building efficient data science pipelines to handle big data. We will also explore popular tools and techniques for processing and analyzing large datasets."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How To"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"random_state = 42"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"data/housing.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" longitude | \n",
" latitude | \n",
" housing_median_age | \n",
" total_rooms | \n",
" total_bedrooms | \n",
" population | \n",
" households | \n",
" median_income | \n",
" median_house_value | \n",
" ocean_proximity | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -122.23 | \n",
" 37.88 | \n",
" 41.0 | \n",
" 880.0 | \n",
" 129.0 | \n",
" 322.0 | \n",
" 126.0 | \n",
" 8.3252 | \n",
" 452600.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 1 | \n",
" -122.22 | \n",
" 37.86 | \n",
" 21.0 | \n",
" 7099.0 | \n",
" 1106.0 | \n",
" 2401.0 | \n",
" 1138.0 | \n",
" 8.3014 | \n",
" 358500.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 2 | \n",
" -122.24 | \n",
" 37.85 | \n",
" 52.0 | \n",
" 1467.0 | \n",
" 190.0 | \n",
" 496.0 | \n",
" 177.0 | \n",
" 7.2574 | \n",
" 352100.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 3 | \n",
" -122.25 | \n",
" 37.85 | \n",
" 52.0 | \n",
" 1274.0 | \n",
" 235.0 | \n",
" 558.0 | \n",
" 219.0 | \n",
" 5.6431 | \n",
" 341300.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 4 | \n",
" -122.25 | \n",
" 37.85 | \n",
" 52.0 | \n",
" 1627.0 | \n",
" 280.0 | \n",
" 565.0 | \n",
" 259.0 | \n",
" 3.8462 | \n",
" 342200.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" longitude latitude housing_median_age total_rooms total_bedrooms \\\n",
"0 -122.23 37.88 41.0 880.0 129.0 \n",
"1 -122.22 37.86 21.0 7099.0 1106.0 \n",
"2 -122.24 37.85 52.0 1467.0 190.0 \n",
"3 -122.25 37.85 52.0 1274.0 235.0 \n",
"4 -122.25 37.85 52.0 1627.0 280.0 \n",
"\n",
" population households median_income median_house_value ocean_proximity \n",
"0 322.0 126.0 8.3252 452600.0 NEAR BAY \n",
"1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n",
"2 496.0 177.0 7.2574 352100.0 NEAR BAY \n",
"3 558.0 219.0 5.6431 341300.0 NEAR BAY \n",
"4 565.0 259.0 3.8462 342200.0 NEAR BAY "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index 128\n",
"longitude 165120\n",
"latitude 165120\n",
"housing_median_age 165120\n",
"total_rooms 165120\n",
"total_bedrooms 165120\n",
"population 165120\n",
"households 165120\n",
"median_income 165120\n",
"median_house_value 165120\n",
"ocean_proximity 1342940\n",
"dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.memory_usage(deep=True)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"longitude float64\n",
"latitude float64\n",
"housing_median_age float64\n",
"total_rooms float64\n",
"total_bedrooms float64\n",
"population float64\n",
"households float64\n",
"median_income float64\n",
"median_house_value float64\n",
"ocean_proximity object\n",
"dtype: object"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"df[\"ocean_proximity\"] = df[\"ocean_proximity\"].astype(\"category\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Index 128\n",
"longitude 165120\n",
"latitude 165120\n",
"housing_median_age 165120\n",
"total_rooms 165120\n",
"total_bedrooms 165120\n",
"population 165120\n",
"households 165120\n",
"median_income 165120\n",
"median_house_value 165120\n",
"ocean_proximity 21124\n",
"dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.memory_usage(deep=True)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index 128\n",
"longitude 165120\n",
"latitude 165120\n",
"housing_median_age 165120\n",
"total_rooms 165120\n",
"total_bedrooms 165120\n",
"population 165120\n",
"households 165120\n",
"median_income 165120\n",
"median_house_value 165120\n",
"ocean_proximity 21124\n",
"dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_small = pd.read_csv(\"data/housing.csv\", dtype={\"ocean_proximity\": \"category\"})\n",
"df_small.memory_usage(deep=True)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" longitude | \n",
" latitude | \n",
" ocean_proximity | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -122.23 | \n",
" 37.88 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 1 | \n",
" -122.22 | \n",
" 37.86 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 2 | \n",
" -122.24 | \n",
" 37.85 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 3 | \n",
" -122.25 | \n",
" 37.85 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 4 | \n",
" -122.25 | \n",
" 37.85 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" longitude latitude ocean_proximity\n",
"0 -122.23 37.88 NEAR BAY\n",
"1 -122.22 37.86 NEAR BAY\n",
"2 -122.24 37.85 NEAR BAY\n",
"3 -122.25 37.85 NEAR BAY\n",
"4 -122.25 37.85 NEAR BAY"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_columns = pd.read_csv(\"data/housing.csv\", usecols=[\"longitude\", \"latitude\", \"ocean_proximity\"])\n",
"df_columns.head()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" longitude | \n",
" latitude | \n",
" ocean_proximity | \n",
"
\n",
" \n",
" \n",
" \n",
" 20046 | \n",
" -119.01 | \n",
" 36.06 | \n",
" INLAND | \n",
"
\n",
" \n",
" 3024 | \n",
" -119.46 | \n",
" 35.14 | \n",
" INLAND | \n",
"
\n",
" \n",
" 15663 | \n",
" -122.44 | \n",
" 37.80 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 20484 | \n",
" -118.72 | \n",
" 34.28 | \n",
" <1H OCEAN | \n",
"
\n",
" \n",
" 9814 | \n",
" -121.93 | \n",
" 36.62 | \n",
" NEAR OCEAN | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 6052 | \n",
" -117.76 | \n",
" 34.04 | \n",
" INLAND | \n",
"
\n",
" \n",
" 15975 | \n",
" -122.45 | \n",
" 37.77 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 14331 | \n",
" -117.15 | \n",
" 32.72 | \n",
" NEAR OCEAN | \n",
"
\n",
" \n",
" 1606 | \n",
" -122.08 | \n",
" 37.88 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 10915 | \n",
" -117.87 | \n",
" 33.73 | \n",
" <1H OCEAN | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 3 columns
\n",
"
"
],
"text/plain": [
" longitude latitude ocean_proximity\n",
"20046 -119.01 36.06 INLAND\n",
"3024 -119.46 35.14 INLAND\n",
"15663 -122.44 37.80 NEAR BAY\n",
"20484 -118.72 34.28 <1H OCEAN\n",
"9814 -121.93 36.62 NEAR OCEAN\n",
"... ... ... ...\n",
"6052 -117.76 34.04 INLAND\n",
"15975 -122.45 37.77 NEAR BAY\n",
"14331 -117.15 32.72 NEAR OCEAN\n",
"1606 -122.08 37.88 NEAR BAY\n",
"10915 -117.87 33.73 <1H OCEAN\n",
"\n",
"[100 rows x 3 columns]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_columns.sample(100, random_state=random_state)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise\n",
"Check out the Dask playground for lazy dataframes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional Resources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Scaling Pandas to Large Datasets](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html)\n",
"- [Dask](https://dask.org/)\n",
"- [Making Pandas Fly](https://youtu.be/C1hqHk1SfrA?t=4551)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}